8  Zooarchaeology

Page under construction

8.1 Case studies

The following map shows the sites under investigation, divided by chronology. Please select the desired chronology (or chronologies) from the legend on the right.

Legend: R = Roman, LR = Late Roman, EMA = Early Middle Ages, Ma = 11th c. onwards

8.2 Medians

The faunal dataset is large (434+ records) and diversified. Looking at the distributions of each animal, the curve is not gaussian. The best choice for non-normal curves is to use medians instead of means to come up with figures that are less dependent on outliers. The function Medians_Chrono_Zoo() (Section 3.1) can be used to return as output weighted medians for each chronology. The in-depth description of how weights are calculated for each sample can be found in Section 6.4.1. To summarise, sites with a very large (i.e. fuzzy) chronology contribute less to the calculation of the median. Table 8.1 provides the median values of the main categories of faunal remains for each chronology, and Table 8.2 the median values for each century. Stronger colours in the cells indicate higher values.

Show the code
Medians_Categorised_per_Chronology_ZOO <- 
  data.frame(
    Medians_Chrono_Zoo(zooarch_cond, "R")*100,
    Medians_Chrono_Zoo(zooarch_cond, "LR")*100,
    Medians_Chrono_Zoo(zooarch_cond, "EMA")*100,
    Medians_Chrono_Zoo(zooarch_cond, "Ma")*100
  )

# Round to 2 digits
Medians_Categorised_per_Chronology_ZOO <- round(Medians_Categorised_per_Chronology_ZOO, 2)

## Weighted medians per century ##
Medians_ZOO_Centuries <- data.frame(
  "I BCE" = zooarch_tables(zooarch_cond, -1)$Medians,  
  "I CE" = zooarch_tables(zooarch_cond, 1)$Medians,
  "II CE" = zooarch_tables(zooarch_cond, 2)$Medians,
  "III CE" = zooarch_tables(zooarch_cond, 3)$Medians,
  "IV CE" = zooarch_tables(zooarch_cond, 4)$Medians,
  "V CE" = zooarch_tables(zooarch_cond, 5)$Medians,
  "VI CE" = zooarch_tables(zooarch_cond, 6)$Medians,
  "VII CE" = zooarch_tables(zooarch_cond, 7)$Medians,
  "VIII CE" = zooarch_tables(zooarch_cond, 8)$Medians,
  "IX CE" = zooarch_tables(zooarch_cond, 9)$Medians,
  "X CE" = zooarch_tables(zooarch_cond, 10)$Medians,
  "XI CE" = zooarch_tables(zooarch_cond, 11)$Medians
)

# Assigning the colnames (optional - instead of roman numerals)
colnames(Medians_ZOO_Centuries) <- c("1st c. BCE", "1st c. CE", "2nd c.", "3rd c.", "4th c.", "5th c.", "6th c.", "7th c.", "8th c.", "9th c.", "10th c.", "11th c.")

# Rounding the medians
Medians_ZOO_Centuries <- round(Medians_ZOO_Centuries, digits=2)

# Removing categories that are not necessary
Medians_ZOO_Centuries <- Medians_ZOO_Centuries[-c(6:9),]
Table 8.1: Weighted medians of zooarchaeological remains, divided by chronology.
Chronologies
R LR EMA Ma
Pigs 49.25 38.00 35.00 35.62
Cattle 10.00 8.00 18.00 19.00
Caprine 23.00 22.00 29.00 27.00
Dom..Fowl 4.00 5.83 5.00 5.00
Edible.W..Mammals 5.00 3.00 2.98 3.00
Fish 1.00 2.00 3.93 1.00
Mollusca 10.00 8.00 4.00 2.89
Unedible.Dom..Mammals 2.00 3.00 4.00 2.00
Unedible.Wild.Mammals 1.00 1.00 1.00 1.00

Pigs’ medians from the Italian peninsula are the highest in each chronology, although their values decrease after the Roman age peak. Cattle medians slightly decrease after the Roman age, even though surprisingly (put a reference here to literature review to explain why surprisingly) the values increase again (18–19.71%) during the early Medieval and Medieval age. The trends for sheeps and goats are also interesting. During the Roman age the Italian median is 25%, slightly decreasing in the 3rd to the 5th century, and increasing again after. When discussing sheep-farming, one must always consider the geographical features from which the data is being collected. This will be discussed later on in the chapter, where more regional and geographical trends will be provided. Domestic fowl (chickens and geese) has quite stable values of 4-5%, with a peak of 7.68% in the 11th century. Wild game peaks during the Roman age, with a median value of 5%, reaching a minimum in the early Middle ages (2%) and rising again in the 11th century. Two considerations must be made for game consumption. The first is that as we will see later on, game consumption is strongly related to the site typology. Secondly, the Roman age value is pulled up by assemblages from the 1st century BCE. After that, the values strongly decrease and by looking at the individual centuries the medians from the 7th century onwards are much higher (ranging from 1.42% to 2.09%).

Table 8.2: Weighted medians of zooarchaeological remains, divided by century.
Faunal remains
Pigs Cattle Caprine Domestic fowl Wild game
1st c. BCE 41.61 10.32 21.18 0.00 1.85
1st c. CE 39.96 11.00 25.27 0.00 0.63
2nd c. 47.06 10.16 20.72 0.79 0.21
3rd c. 38.69 8.04 18.77 1.94 0.81
4th c. 33.52 8.61 23.18 1.93 1.00
5th c. 33.68 14.24 24.89 2.52 0.91
6th c. 32.60 20.66 28.44 2.62 0.56
7th c. 30.42 18.52 30.14 4.51 1.58
8th c. 33.63 13.89 30.13 2.70 1.18
9th c. 37.54 11.76 23.32 1.16 1.54
10th c. 35.77 14.36 25.64 1.63 1.93
11th c. 34.08 19.34 27.44 1.91 2.09
* The color gradients in this table are used to indicate the chronologies.

8.2.1 Medians of faunal remains by context type

The weighted medians included below have been generated using the package dplyr and the summarize() function, applied to the exported relative proportion table (using the custom function zooarch_tables_general(), described in Section 3.2). The medians have been calculated for four animal categories (pigs, cattle, caprine, and game) for each site type and chronology. After, similar context types have been merged to simplify the reading; for example, the category Castle has been merged with the category Castrum, as they both indicate élite/military fortified contexts.

(a) Pigs

(b) Cattle

(c) Caprine

(d) Game

Figure 8.1: Medians (%) of edible animal remains, divided by site type and chronology.

8.2.2 Medians of faunal remains by macro region

The process for generating weighted medians for the three Italian macro regions (Southern, Central and Northern Italy) has followed the same logic used in the previous section. The medians have been calculated for four animal categories (pigs, cattle, caprine, and game) for each macro area and chronology.

(a) Pigs

(b) Cattle

(c) Caprine

(d) Game

Figure 8.2: Medians (%) of edible animal remains, plotted by macroregion and chronology.

8.2.3 Medians of faunal remains by geography type

Weighted medians have been generated for the four geographies considered (plain, hill, hilltop, coast, and mountain), after the categories Hill and Hilltop have been merged. The medians have been calculated for four animal categories (pigs, cattle, caprine, and game) for each geography and chronology.

(a) Pigs

(b) Cattle

(c) Caprine

(d) Game

Figure 8.3: Medians (%) of edible animal remains, plotted by geography and chronology.

8.2.4 Caprine vs altitude

8.3 Future work to do

1. Animal sizes must be considered:

  • find a way

  • at least cattle

  • at least in the south

2. Feature selection models: Which feature is most discriminating the dataset?

  • E.g. Is the geographical context (Plain, Hill, …) more important than Southern vs Northern Italy?

  • It might be worth it, if data allows it, to subset the geographical features of Southern Italy or by main type (Rural vs Urban / Elite vs non-Elite)

  • Maybe soil type is not so useful for animals? Ask someone

  • What causes a different distribution of cattle in Italy in each chronology?

  • What are the factors that influence the most the different distribution of domestic animals?

8.4 Test 1: RDA + Anova

The RDA is performed on the condensed zooarchaeological export from the database, using the vegan package. The absolute counts are converted to frequencies (using the function decostand()).

The density of each animal % is shown in the plots below.

After removing the column Chronology, which caused the repetition of some of the rows (samples spanning two chronologies), the function str() is called to show the structure of the dataset.

'data.frame':   347 obs. of  9 variables:
 $ Pigs             : num  4.78 34.57 35.71 48.72 34.42 ...
 $ Cattle           : num  93.49 34.57 22.22 9.36 8.96 ...
 $ Caprine          : num  1.63 23.46 6.35 24.43 28.72 ...
 $ Edible.W..Mammals: num  0 1.23 26.98 4.13 2.85 ...
 $ Type             : Factor w/ 4 levels "Fortified","Rural",..: 4 4 4 1 1 2 4 4 4 4 ...
 $ Chronology       : chr  "LR" "LR" "LR" "LR" ...
 $ Geo              : Factor w/ 4 levels "Coast","Hill",..: 4 2 4 3 3 4 4 4 4 4 ...
 $ Altitude         : int  29 123 330 615 224 96 55 172 172 172 ...
 $ Macroregion      : Factor w/ 3 levels "Central Italy",..: 2 2 2 2 2 2 2 2 2 2 ...
Call: rda(formula = data_test.com ~ Type + Chronology + Geo + Altitude
+ Macroregion, data = data_test.var)

                Inertia Proportion Rank
Total         1029.9921     1.0000     
Constrained    212.4942     0.2063    4
Unconstrained  817.4979     0.7937    4
Inertia is variance 

Eigenvalues for constrained axes:
  RDA1   RDA2   RDA3   RDA4 
160.89  37.89  11.35   2.36 

Eigenvalues for unconstrained axes:
  PC1   PC2   PC3   PC4 
442.5 256.2  93.5  25.2 

$r.squared
[1] 0.2063067

$adj.r.squared
[1] 0.1777907
Permutation test for rda under reduced model
Marginal effects of terms
Permutation: free
Number of permutations: 999

Model: rda(formula = data_test.com ~ Type + Chronology + Geo + Altitude + Macroregion, data = data_test.var)
             Df Variance       F Pr(>F)    
Type          3    60.72  8.2689  0.001 ***
Chronology    3    14.19  1.9331  0.043 *  
Geo           3    12.63  1.7194  0.098 .  
Altitude      1     2.35  0.9617  0.375    
Macroregion   2    91.72 18.7372  0.001 ***
Residual    334   817.50                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.5 Test 1: Random forest

Move this part to Methods

EXPAND! Unpack!

A random forest is a machine learning algorithm that is used for classification and regression tasks. It is called a “forest” because it is made up of many individual decision trees. Each tree in the forest makes a prediction, and the overall prediction is made by combining the predictions of all of the individual trees. The idea behind a random forest is that, by combining the predictions of many individual trees, the overall prediction will be more accurate than if any single tree were used on its own. The “random” part of the name comes from the fact that the trees in a random forest are built using a technique called random sampling, which helps to reduce the overfitting that can occur when training a single decision tree.

In this study, a random forest is used to identify the key variables that influence the farmers’ choices of domestic animals in the first millennium CE. The focus is on pigs, cattle, sheep/goats, and edible wild mammals. Five variables are used to predict the farmers’ choices: altitude, site type, macroregion, geography, and chronology. By analysing these variables with the random forest algorithm, the goal is to gain a better understanding of the factors that drive farmers’ decision-making. Random forests set to regression return the percentage of variance explained by the model and the mean of the squared residuals. The percentage of variance explained by the model is a measure of how well the model fits the data—it tells us how much of the variation in the data can be explained by the model. The mean of the squared residuals is a measure of the error of the model. In other words, it tells us how much the model’s predictions deviate from the true values in the data. A low mean of squared residuals indicates that the model’s predictions are close to the true values, while a high mean of squared residuals indicates that the model’s predictions are far from the true values.

Show the code
############################################################################
##
## RANDOM FOREST
## Research questions:
## 1. Factors influencing most the different distributions of animals in 
##    the dataset.
## 2. Factors that have the highest impact on the distribution of Cattle
##    in each chronology.
##
############################################################################

#######################
## Read the libraries
#######################

library(vegan)
library(randomForest)
library(stringr)
library(tidyverse)

############################################################################
## Read the file: Zooarch_Condensed_with_altitude.csv
## This file is the Zooarch_Condensed to which it 
## has been added a column containing the altitude.
############################################################################

data_test <- read.csv("/Users/robertoragno/Desktop/University/Bari/PhD - Quarto/Database export/Zooarch_Condensed_with_altitude.csv", header=TRUE, sep=";")

############################################################################
##
## Reformat the data so that:
## 1. The absolute counts become frequencies.
##    - Using the decostand() function in the vegan package
## 2. The columns subsetted are: 
##    - Type 'numerical': Pigs, Cattle, Caprine, Edible.W..Mammals, Altitude
##    - Type 'factor': Type, Chronology, Geo, Macroregion
## 3. In the subsetted dataframe, convert the relative counts to %s.
## 4. The column Type has too many categories. Some are merged using the 
##    str_replace() function in the stringr library.
## 5. Interpreted correctly by randomForest:
##    - The columns Type, Chronology, Geo, Macroregion have to be converted from
##      chr to factor.
## 6. NAs are converted to 0s.
##
############################################################################

# 1. Convert to frequencies
data_test[c(16:22)] <- decostand(data_test[c(16:22)], method="total", na.rm = TRUE)*100

# 2. Subsetting the dataframe
data_test <- data_test[c(16,17,18,20,4,6,7,10)]

# 3. Merge some categories in the 'Type' column
data_test$Type <- str_replace(data_test$Type, "Rural site, mansio", "Rural")      
data_test$Type <- str_replace(data_test$Type, "Religious, monastery", "Religious")
data_test$Type <- str_replace(data_test$Type, "Castle", "Fortified")
data_test$Type <- str_replace(data_test$Type, "Castrum", "Fortified")
data_test$Geo <- str_replace(data_test$Geo, "Hilltop", "Hill")

# 4a. Remove the categories 'Necropolis' and 'Religious' as they might skew
# the results.
data_test <- filter(data_test, Type!="Necropolis" & Type!="Religious")

# If I need to select a chronology:
#data_test <- filter(data_test, Chronology=="EMA")

# 5. Convert chr columns to factors.
data_test$Type <- as.factor(data_test$Type)
#data_test$Chronology <- as.factor(data_test$Chronology)
data_test$Geo <- as.factor(data_test$Geo)
data_test$Macroregion <- as.factor(data_test$Macroregion)

# 6. Convert NAs to 0s.
data_test[is.na(data_test)]<- 0

# Remove duplicate rows
data_test <- data_test[!duplicated(data_test[,-6]), ]   
library(tidyverse)
data_test <- filter(data_test, rowSums(data_test[,c(1:4)]) >0)

############################################################################
##
##  RANDOM FOREST
##  1. We set.seed() for reproducible results.
##  2. We use the function randomForest() from the randomforest package:
##     - We choose as response/target variables the four categories of animals %. 
##     - We choose as predictors the site Type, Chronology, Geography, 
##       Altitude, Macroregion.
##
############################################################################

# 1. Set seed for reproducible results
set.seed(42)

# 2. Create randomForest using the 4 numerical columns of animals as predictors
#    and 'Type' as target. 

# 3. For convenience, store the predictors names separately
#    Output: "Type", "Chronology","Geo","Altitude","Macroregion"
predictors1 <- names(data_test[5:ncol(data_test)])

# 4. Perform the randomForest and store the results in a variable called 'model'
model <- randomForest(data_test$Pigs+data_test$Caprine+data_test$Cattle+data_test$Edible.W..Mammals ~ .,
                      data = data_test[, predictors1],
                      ntree = 3000,
                      mtry = 4,
                      importance = TRUE)

The amount of variance explained by our model is not high, but it may be sufficient for our current purposes. We can use this information to conduct more detailed and focused analyses of our results. This will allow us to gain a deeper understanding of the factors that are influencing the outcome of our study. Although the predictive power of our model may not be as strong as we would like, it is still useful for helping us to explore the relationship between the variables we have considered.


Call:
 randomForest(formula = data_test$Pigs + data_test$Caprine + data_test$Cattle +      data_test$Edible.W..Mammals ~ ., data = data_test[, predictors1],      ntree = 3000, mtry = 4, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 3000
No. of variables tried at each split: 4

          Mean of squared residuals: 173.3055
                    % Var explained: 21.79

It is possible to use the importance() function in the randomforest package to identify the key variables that are most influential in predicting the outcome of a zooarchaeological study. In this case, altitude was found to be the most important variable, followed by macroregion and type/geography. Chronology was also found to be important, but not to the same extent as the other factors. It is important to note that this may be due to the fact that some entries in the dataset cover multiple chronologies. To address this issue, the analysis can be repeated using individual chronologies. This will provide a more accurate understanding of the factors that influence the farmers’ choices of domestic animals in the first millennium CE.

8.6 RF 2:

A random forest analysis was performed using only cattle as the response variable and the other columns as predictors. The resulting model did not explain a high percentage of the variance. However, the analysis did reveal that Altitude and Macroregion are the most important factor in determining the distribution of cattle in Italian sites. This suggests that geographical location may be a significant factor in the distribution of domestic animals. By focusing on these variables, it is possible to gain a better understanding of the patterns of animal husbandry in the first millennium CE.

Show the code
# Random forest 2

# Specify response and predictor variables
response2 <- "Cattle"
predictors2 <- names(data_test[5:ncol(data_test)])

# Fit random forest model
model2 <- randomForest(data_test[, response2] ~ .,
                      data = data_test[, predictors2],
                      ntree = 4000,
                      mtry = 4,
                      importance = TRUE)

model2

Call:
 randomForest(formula = data_test[, response2] ~ ., data = data_test[,      predictors2], ntree = 4000, mtry = 4, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 4000
No. of variables tried at each split: 4

          Mean of squared residuals: 256.1973
                    % Var explained: 28.07
Show the code
varImpPlot(model2, scale=F)